Psychometric Measurement of Forecasters Using the Wisdom of Crowds

Quant Brownbag 2026-01-22

Jessica Helmer, Sophie Ma Zhu, Nikolay Petrov, Ezra Karger, Mark Himmelstein

Forecasting

Predictions about future events

Been researched through large forecasting tournaments

Forecasting ability relatively consistent, past performance predicts future performance

Some people better than others at forecasting

○ Superforecasters

Want to be able to identify highly skilled forecasters so decision makers can have access to best possible information

Measuring Forecasting Skill

Typically assess forecasting skill by examining the squared distance between a forecaster’s forecast and some scoring criterion over many items

\[ X_n = \frac{1}{K} \sum_{k=1}^K (F_{nk} - C_k)^2 \]

\(k\) indexing \(K\) items,
\(F_{nk}\) representing forecaster \(n\)’s forecast on item \(k\), and
\(C_k\) representing a scoring criterion for item \(k\)

Measuring Forecasting Skill

\[ X_n = \frac{1}{K} \sum_{k=1}^K (F_{nk} - C_k)^2 \]

In typical testing, \(C_k\) would be fixed and known

○ What’s the capital of Virginia?

In forecasting, item outcome not known at time of testing

○ What will the population of Atlanta be in 2050?

“Correct answer” more like a random variable

Measuring Forecasting Skill

“Correct answer” more like a random variable

E.g., for an hypothetical item \(k\) that is normally distributed with an unknown mean and standard deviation, we can represent this as \[ O_k \sim \mathcal{N}(\mu_k, \sigma_k) \]

An optimal point forecast for this item \(F_k\) would be \(\mu_k\), so we would ideally score \(F_k\) by comparing it to \(\mu_k\)

\[ X_n^{\mu_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - \mu_k)^2 \]

Measuring Forecasting Skill

But, since \(\mu_k\) is rarely known, we typically use the outcome \(O_k\) as the scoring criterion instead.

\[X_n^{O_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - O_k)^2 \text{ where } O_k \sim \mathcal{N}(\mu_k, \sigma_k)\]

Measurement error is clear if we rewrite this as

\[ X_n^{O_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - (\mu_k + \epsilon_{O_k}))^2 \]

, and ideally, we would want to minimize \(\epsilon_{O_k}\).

Intersubjective Measures

Using peer comparisons to score forecasts instead of using outcomes

Wisdom of the crowds: when many predictions are aggregated, individual errors tend to cancel out.

○ Aggregate predictions tend to be more accurate than the average prediction within a crowd

Crowd Aggregates as Scoring Criterion

Wisdom of crowds assumes that the distribution of forecasts for an item has the same mean as the outcome distribution \(\mu_{F_k} = \mu_{k}\).

\[ F_k \sim \mathcal{N} (\mu_k, \sigma_{F_k}) \]

Under the Central Limit Theorem, the average of many forecasts will converge on the average of the outcome distribution with error (inversely) proportional to the sample size of forecasters \(N\)

\[ A_k \sim \mathcal{N} (\mu_k, \frac{\sigma_{F_k}}{\sqrt{N}}) \]

Crowd Aggregates as Scoring Criterion

Substituting the crowd aggregate \(A_k\) for the scoring criterion,

\[X_n^{A_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - A_k)^2 \text{ where } A_k \sim \mathcal{N} (\mu_k, \frac{\sigma_{F_k}}{\sqrt{N}}) \]

\[ X_n^{A_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - (\mu_k + \epsilon_{A_k}))^2 \]

As long as a crowd is unbiased (\(\mu_{F_k}=\mu_k\)) and sufficiently large \(\left(\frac{\sigma_{F_k}}{\sqrt N}<\sigma_k\right)\), on average \(\epsilon_{A_k}^2<\epsilon_{O_k}^2\) and the wisdom of that crowd \(A_k\) will be a better representation of the optimal point forecast \(\mu_k\) than \(O_k\).

Intersubjective Measures

Using peer comparisons to score forecasts instead of outcomes

Wisdom of the crowds: when many predictions are aggregated, individual errors tend to cancel out.

○ Aggregate predictions tend to be more accurate than the average prediction within a crowd


As long as a crowd is unbiased (\(\mu_{F_k}=\mu_k\)) and sufficiently large \(\left(\frac{\sigma_{F_k}}{\sqrt N}<\sigma_k\right)\), on average \(\epsilon_{A_k}^2<\epsilon_{O_k}^2\) and the wisdom of that crowd \(A_k\) will be a better representation of the optimal point forecast \(\mu_k\) than \(O_k\).

Simulation

\(N\) = 1,000 forecasters

\(\theta_n \sim \mathcal{N}(0, 1)\)
○ Varying skill levels

\(K\) = 1,000 items

\(O_k \sim \mathcal{N}(0, \sigma_k)\)
○ Specified item “noisiness”


Generated forecasts for each forecaster on each item

\(\theta_n\) defined the expected distance between an item’s expected outcome and a forecasters’ forecast

○ Mean of forecaster’s forecast distribution \(\mu_{nk} \sim \mathcal{N}(0, e^{\theta_n})\)
○ Forecaster’s forecast distribution on that item \(f_{nk} \sim \mathcal{N}(\mu_{nk}, \sigma_k)\)
○ Drew [.05, .25, .50, .75, .95] quantiles for each forecast

Simulation

Sampled combinations of:

\(N \in\) [4, 8, 16, 32] forecasters and
\(K \in\) [4, 8, 16, 32] items for
\(\sigma_k \in\) [1, 2]

Scored each forecast with:

○ ground truth
○ intersubjective scoring (squared distance from aggregate forecast of group of size \(N\))

Which scoring measure captures forecasters’ skill better?

Simulation: \(\sigma = 1\)

Intersubjective scoring correlation increases with \(N\)

For \(N\) ≥ 8, intersubjective scoring captures original skill parameter better than ground-truth

Simulation: \(\sigma = 2\)

Increased variance affects ground-truth scoring but not intersubjective scoring

Intersubjective scoring has stronger correlations with skill at lower \(N\) and \(K\) combinations than before

Intersubjective Measures

Types of intersubjective measures

Proxy Scoring

○ Scoring by a forecast’s squared distance from the aggregate forecast

Metapredictions

○ Explicit predictions about crowd aggregates
○ What would the average person predict that the population of Atlanta will be in 2050?

If a forecaster correctly identified that \(\mu_k \neq \mu_{F_k}\) (i.e., bias in the crowd), proxy scoring would incorrectly penalize the forecaster

Metapredictions explicitly ask about beliefs of \(\mu_{F_k}\) but also require more effort

Intersubjective Measures Existing Work

Tested for their real-time scoring ability but were surprisingly successful at predicting forecasting accuracy

○ More reliable than ground-truth scoring

But only explored with categorical forecasting items

Ground-Truth Outcome Intersubjective Outcome
Categorical Item Price of gas was between $3.50 and $4.00 Average probability judgement about the price of gas being between $3.50 and $4.00 was 52%
Continuous Item Price of gas was $3.78 Average prediction for price of gas was $3.80

Measurement scale confounds reliability boost from intersubjective measures

Present Study

  1. Are intersubjective measures still good predictors of forecasting accuracy with continuous items?
  1. Are proxy scores or metapredictions stronger predictors of forecasting accuracy?

Superforecasters

Previously identified skilled forecasters

Aggregating these superforecasters may serve as a better reference criterion than aggregating the general crowd

Proxy scoring with a superforecaster crowd aggregate has been an effective method

○ But has only been implemented between-subjects

Present Study

  1. Are intersubjective measures still good predictors of forecasting accuracy in the (superior reliability) Quantile Elicitation Format (QEF)?

  2. Are proxy scores or metapredictions stronger predictors of forecasting accuracy?

  1. Are superforecaster aggregates a better reference criterion than general crowd aggregates?

Methods

Final wave of a longitudinal forecasting study (N = 894)

Forecasts on six items in the QEF

Metapredictions after each item:

○ What do you think the average person would predict?
○ What do you think a superforecaster would predict?

Additional sample of N = 42 superforecasters

Scoring Methods

Scored each forecast with:

Ground Truth (own forecast distance from actual outcome)
○ proxy scoring (own forecast distance from crowd aggregate)
Proxy Crowd and Proxy Super
○ metaprediction accuracy (metaprediction distance from crowd aggregate)
Metaprediction Crowd and Metaprediction Super

Compared these scores to forecasting accuracy on separate set of thirty items

Aggregate Accuracy

Ground-truth score distributions

Superforecaster aggregate more accurate more often

Correlations

Superforecaster metapredictions and proxy scores strongest

 

Contributions to Forecasting Proficiency

How much variance in forecaster accuracy does each score explain?

Random effects for item and person, conducted dominance analysis

Contribution Proportion
Proxy Super .119 .225
Metaprediction Super .118 .223
Proxy Crowd .109 .206
Metaprediction Crowd .094 .178
Ground Truth .089 .168
\(R_{forecaster}^2\) .53 1.00

Select Crowds

Proxy scores effective way of finding select crowds to aggregate

Discussion

Intersubjective measures still effective measure of forecasting ability

○ More reliable picture of the truth

Discussion

Intersubjective measures still effective measure of forecasting ability

○ More reliable picture of the truth
○ Less influenced by unpredictability in items
○ Superforecaster aggregates particularly useful

Can reduce measurement error in talent spotting by modifying the scoring criterion

References

Atanasov, P., & Himmelstein, M. (2023). Talent spotting in crowd prediction. In M. Seifert (Ed.), Judgment in Predictive Analytics. Springer.

Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management Science, 63(3), 691–706. https://doi.org/10.1287/mnsc.2015.2374

Galton, F. (1907). Vox Populi. Nature, 75(1949), 450–451. https://doi.org/10.1038/075450a0

Himmelstein, M., Budescu, D. V., & Ho, E. H. (2023). The wisdom of many in few: Finding individuals who are as wise as the crowd. Journal of Experimental Psychology: General, 152(5), 1223–1244. https://doi.org/doi.org/10.1037/xge0001340

Himmelstein, M., Zhu, S. M., Petrov, N., Karger, E., Helmer, J., Livnat, S., Bennett, A., Hedley, P., & Tetlock, P. (2025). The Forecasting Proficiency Test: A General Use Assessment of Forecasting Ability. OSF. https://doi.org/10.31234/osf.io/a7kdx

Karger, E., Monrad, J., Mellers, B., & Tetlock, P. (2021). Reciprocal Scoring: A Method for Forecasting Unanswerable Questions. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3954498

Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

Wilkening, T., Martinie, M., & Howe, P. D. L. (2022). Hidden Experts in the Crowd: Using Meta-Predictions to Leverage Expertise in Single-Question Prediction Problems. Management Science, 68(1), 487–508. https://doi.org/10.1287/mnsc.2020.3919

Zhu, S. M., Budescu, D. V., Petrov, N., Karger, E., & Himmelstein, M. (2024). The psychometric properties of probability and quantile forecasts. Preprint.

Thank you!

Questions?

–>